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Abstract 

The general Markov plus invariable sites (GM+I) model of biological sequence evo- 
lution is a two-class model in which an unknown proportion of sites are not allowed 
to change, while the remainder undergo substitutions according to a Markov pro- 
cess on a tree. For statistical use it is important to know if the model is identifiable; 
can both the tree topology and the numerical parameters be determined from a 
joint distribution describing sequences only at the leaves of the tree? We establish 
that for generic parameters both the tree and all numerical parameter values can 
be recovered, up to clearly understood issues of 'label swapping.' The method of 
analysis is algebraic, using phylogenetic invariants to study the variety defined by 
the model. Simple rational formulas, expressed in terms of determinantal ratios, are 
found for recovering numerical parameters describing the invariable sites. 
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1 Introduction 



If a model of biological sequence evolution is to be used for phylogenetic infer- 
ence, it is essential that the model parameters of interest — certainly the tree 
parameter and usually the numerical parameters — be identifiable from the 
joint distribution of states at the leaves of the tree. Though often unstated, the 
assumption that model parameters are identifiable underlies the use of both 
Maximum Likelihood and Bayesian inference methods. As increasingly com- 
plicated models, incorporating across-site rate variation, covarion structure, 
or other types of mixtures, are implemented in software packages, there is a 
real possibility that non-identifiability could confound data analysis. Unfor- 
tunately, our theoretical understanding of this issue lags well behind current 
phylogenetic practice. 

One natural approach to proving the identifiability of the tree topology relies 
on the definition of a phylogenetic distance for the model, and the 4-point 
condition of Buneman [1]. For instance. Steel [2] used the log-det distance 
to establish the identifiability of the tree topology under the general Markov 
model and its submodels. Such a distance-based argument shows additionally 
that 2-marginalizations of the full joint distribution suffice to recover the tree 
parameter, since distances require only two-sequence comparisons. Once the 
tree has been identified, the numerical parameters giving rise to a joint dis- 
tribution for the general Markov model can be determined by an argument of 
Chang [3]. 

However, for more general mixture models and rates-across-sites models no 
appropriate definition of a distance is known, so proving the identifiability of 
the tree parameter requires a different approach. (Though distance measures 
have been developed for GTR models with rate-substitution [4,5], these require 
that one know the rate distribution completely, and identifiability of the rate 
distribution has yet to be addressed. Although identifiability of the popular 
GTR-|-I-|-r model of sequence evolution was considered in [6], there are gaps 
in the argument, as was pointed out to us by Ane [7].) 

In [8], the viewpoint of algebraic geometry is used to show the generic iden- 
tifiability of the tree parameter for the covarion model of [9] and for certain 
mixture models with a small number of classes. Though this result is far more 
general than previous identifiability results, it still fails to cover the type of 
rate-variation models currently in common use for data analysis, and does not 
address identifiability of numerical parameters at all. Much more study of the 
identifiability question is needed. 

In this paper, we focus on the general Markov plus invariable sites, GM-I-I, 
model of sequence evolution, a model that encompasses the GTR-I-I model 
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that is of more immediate interest to practitioners. Note that previous work 
on GM+I by Baake [10] focused on non-identifiabihty. In that paper parame- 
ter choices for the 2-state GM+I model on two distinct 4-taxon trees are con- 
structed that give rise to the same pairwise joint distributions (2-marginals) . 
As both sets of parameters have 50% invariable sites, this shows that the 
identifiability of the tree parameter cannot generally hold on the basis of 
2-sequence comparisons, even if the distribution of rate factors is known. Fur- 
thermore, it shows a well-behaved phylogenetic distance cannot be defined for 
this model, as existence of such a distance would imply tree identifiabihty. 

Here we prove that all parameters for the GM-I-I model are indeed identifiable, 
through 4-sequence comparisons. By identifiable, we mean generically identifi- 
able in a geometric sense: For a fixed tree, the set of numerical parameters for 
which the joint distribution could have arisen from either a) a different tree, 
or b) a 'significantly different' (in a sense to be made clear later) choice of nu- 
merical parameters on the same tree, is of strictly lower dimension than that 
of the full numerical parameter space. (For a concrete example of generic iden- 
tifiability, recall the results of Steel and Chang on the general Markov model: 
assumptions that the Markov edge matrices Mg have determinant 7^ 0, 1 and 
that the distribution of states at the root has strictly positive entries ensure 
identifiability of all parameters. These are generic conditions.) Thus for nat- 
ural probability distributions on the parameter space, with probability one a 
choice of parameters is generic. 

Although identifiability of the tree parameter for GM+I follows from more 
general results in [8], that paper did not consider identifiability of numerical 
parameters. Our arguments here are tailored to GM+I and yield stronger re- 
sults addressing numerical parameters as well as the tree. Our approach is 
again based on the determination of phylogenetic invariants for the model. 
While the invariants described in [8] are invariants for more general models 
than GM+I, the ones given in this paper apply only to GM+I and its sub- 
models, and are of much lower degree. As a byproduct of the development of 
these GM+I invariants, we are led to rational formulas for recovering all the 
parameters related to the invariable sites from a joint distribution. Indeed, 
these formulas are crucial to our identification of numerical parameters. 

These formulas can be viewed as GM+I analogs of the formulas for the pro- 
portion of invariable sites in group-based+I models that were found by the 
capture- recapture argument of [11]. In the group-based setting, those formulas 
were developed into a heuristic means of estimating the proportion of invari- 
able sites from data without performing a full tree inference. This has been 
implemented in SplitsTree4 [12]. However, it remains unclear whether a sim- 
ilar useful heuristic can be found for the formulas presented in this paper. 
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Since our algebraic methods at times employ computational commutative al- 
gebra software packages, and these tool are not commonly used in the phy- 
logenetics literature, we have included some examples of code in Appendix 
A. 



2 The GM+I Model 

Let T denote an n-taxon tree, by which we mean a tree with n leaves labeled 
by the taxa ai, 02, . . . , a„ and all internal vertices of valence at least 3. We say 
T is binary if all internal nodes have valence exactly 3. 

We begin by describing the parameterization of the /t-state GM+1 model of 
sequence evolution along T, where k = A corresponds to usual models of DNA 
evolution. The class size parameter 5 denotes the probability that any par- 
ticular site in a sequence is invariable: conceptually, the flip of a biased coin 
weighted by 6 determines if a site is allowed to undergo state transitions. If 
a site is invariable, it is assigned state i G [k] = {1, 2, . . . , /t} with probabil- 
ity 7r/(i). Here tt/ = (7r/(l), . . . ,7r/(/c)) is a vector of non-negative numbers 
summing to 1 giving the state distribution for invariable sites. 

All sites that are not invariable mutate according to a common set of parame- 
ters for the GM model, though independently of one another. For these sites, 
we associate to each node (including leaves) of T a random variable with state 
space [k\. Choosing any node r of T to serve as a root, and directing all edges 
away from r, let T,. denote the resulting directed tree T. A root distribution 
vector ttqm = i'^Gui'^), ■ ■ ■ with non-negative entries summing to 

1, has entries ttgmU) specifying the probability that the root variable is in 
state j. For each directed edge e = — > w) of T^, let Mg be Markov 
matrix, so that Me{i,j) specifies the conditional probability that the variable 
at w is in state j given that the variable at v is in state i. Thus entries of all 
Mg are non- negative, with rows summing to 1. 

For the GM-I-I model on an n-taxon tree T with edge set E, the stochastic 
parameter space S C [0, 1]^ is of dimension N — 1 + {k — 1) + {k — 1) + 
\E\k,{k, — 1) = 2k — 1 + \E\k{k~1). The parameterization map giving the joint 
distribution of the variables at the leaves of T is denoted by 

SI — > P. 

We view P as an n-dimensional kx- ■ - xn array, with dimensions corresponding 
to the ordered taxa ai, 02, . . . , a^, and with entries indexed by the states at 
the leaves of T. The entries of P are polynomial functions in the parameters 
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s explicitly given by 
P{ii, . . . = 

Se{ii,i2,. . .in)'Ki{ii) + {1 - S) ['^GMijr)]jMe{jvi,jvf)]- (1) 

Here e{ii,i2, ■ ■ - in) is 1 if all ij are equal and otherwise, the product is 
taken over all edges e = {vi Vf) G E, and the sum is taken over the set 
of all possible assignments of states to nodes of T extending the assignment 
(ii, . . . , in) to the leaves: If V is the set of vertices of T then 

— {Uv) £ [/^l'^' I jv = ife if is a leaf labeled by Ofej . 

For notational ease, the entries of P, the pattern frequencies, are also denoted 
byPn...i„ = P{ii,...,in)- 

We note that while a root r was chosen for the tree in order to explicitly 
describe the GM portion of the parameterization of our model, the particular 
choice of r is not important. Under mild additional restrictions on model pa- 
rameters, changing the root location corresponds to a simple invertible change 
of variables in the parameterization. (See [13], [14], or [15] for details.) This 
justifies our slight abuse of language in referring to the GM or GM+I model 
on T, rather than on Tr, and we omit future references to root location. 

Note that equation (1) allows us to more succinctly describe any P e Im(0T) 
as 

P^{1-S)Pgm + SPi (2) 

where Pgm is an array in the image of the GM parameterization map on T 
and Pj = diag(7r/) is an n-dimensional array whose off-diagonal entries are 
zeros and whose diagonal entries are those of ttj. 



3 Model Identifiability 



We now make precise the various concepts of identifiability of a phylogenetic 
model. To adapt standard statistical language to the phylogenetic setting, for 
a fixed set A of n taxa and k. > 2, consider a collection of pairs (T, 
where T is an n-taxon tree with leaf labels A, and 0t : •S't — * [0, l]""" is 
a parameterization map of the joint distribution of pattern frequencies for 
the model on T. We say the tree parameter is identifiable for Ai if for every 
P G ^{T,^T)eM Iiii(0t), there is a unique T such that P G lm(0T)- We say that 
numerical parameters are identifiable on a tree T if the map (f>T is injective, 
that is if for every P G lm{(f)T) there is a unique s E St with (j)T{s) — P. We 
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say the model M. is identifiable if the tree parameter is identifiable, and for 
each tree the numerical parameters are identifiable. 

It is well-known that such a definition of identifiability is too stringent for 
phylogenetics. First, unless one restricts parameter spaces, there is little hope 
that the tree parameter be identifiable: One need only think of any standard 
model on a binary 4-taxon tree in which the Markov matrix parameter on 
the internal edge is the identity matrix. Any joint distribution arising from 
such a parameter choice could have as well arisen from any other 4-taxon tree 
topology. 

Even if such 'special' parameter choices are excluded so the tree parameter be- 
comes identifiable, identifiability of numerical parameters also poses problems, 
as noted by Chang [3] . For example, consider the 3-taxon tree with the GM 
model. Then multiple parameter choices give rise to the same joint distribu- 
tion since the labeling of the states at the internal node can be permuted in k\ 
ways, as long as the Markov matrix parameters are adjusted accordingly [14]. 
The occurrence of this sort of 'label-swapping' non-identifiability in statistical 
models with hidden (unobserved) variables is well-known, but is not of great 
concern. However, even for this model more subtle forms of non-identifiability 
can occur, in which infinitely many parameter choices lead to the same joint 
distribution. These arise from singularities in the model, and can be avoided 
by again restricting parameter space. Such 'generic' conditions for the GM 
model have already been mentioned in the introduction. 

We therefore refine our notions of identifiability. Because we are concerned 
primarily with model where the maps 0r are given by polynomials, we give a 
formulation appropriate to that setting. Recall that given any collection T of 
polynomials in N variables, their common zero set, 

= e I f{z) = for aU / e J^}, 

is the algebraic variety defined by JF. If the algebraic variety is a proper subset 
of C^, then it is said to be proper. 

Definition 1 Let Ai be a model on a collection of n-taxon trees, as defined 
above. 

(1) We say the tree parameter is generically identifiable for M. if for each 
tree T there exists a proper algebraic variety Xt with the property that 

P E [J 0t('S't \ Xt) implies P & (I)t{St ^ Xt) for a unique T. 

(2) We say that numerical parameters are generically locally identifiable on a 
tree T if there is a proper algebraic variety Yt such that for all s G St^Yt, 
there is a neighborhood of s on which (pT is injective. 
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(3) We say the model Ai is gencrically locally identifiable if the tree parame- 
ter is generically identifiable, and for each tree the numerical parameters 
are generically locally identifiable. 

Note that the notion of 'generic' here is used to mean 'for all parameters 
but those lying on a proper subvariety of the parameter space,' and such a 
variety is necessarily of lower dimension than the full parameter space. Using 
the standard measure on the parameter space, viewed as a subset of M^, this 
notion thus also implies 'for all parameters except those in a set of measure 
0.' 

In the important special case of parameterization maps defined by polynomial 
formulas, such as that for the GM+I model, generic local identifiability of 
numerical parameters is equivalent to the notion in algebraic geometry of the 
map (f)T being generically finite. In this case, there exists a proper variety Yx 
and an integer k, the degree of the map (pT, such that restricted to St ^ Yt 
the map (pT is not only locally injective but also A;-to-l: That is, if s G 5^ \ 1^ 
and P = 0t(s), then the fiber (^'^^{P) has cardinality k. 

Because of the label swapping issue at internal nodes, for the GM model and 
GM+I on an n-taxon tree T with vertex set V , fibers of generic points will 
always have cardinality at least — n). Thus for these models, the best 

we can hope for is generic local identifiability of the model (both tree and 
numerical parameters) where the generic fiber has exactly this cardinality. 
That in fact is what we establish in the next section. 



4 Generic Identifiability for the GM+I model 

Wc begin our arguments by determining some phylogenctic invariants for the 
GM+I model. The notion of a phylogenetic invariant was introduced by Caven- 
der and Felsenstein [16] and Lake [17], in the hope that phylogenetic invariants 
might be useful for practical tree inference. Their role here, in proving identi- 
fiability, is more theoretical but illustrates their value in analyzing models. 

For a parameterization 0t given by polynomial formulas on domain St C R^, 
we may uniquely extend to a polynomial map with domain C^, given by the 
same polynomial formulas, which we again denote by 0r : — > C". 

Remark 2 Extending parameters to include complex values is solely for math- 
ematical convenience, as algebraic geometry provides the natural setting for 
our viewpoint. The collection of stochastic joint distributions (arising from 
the original stochastic parameter space) is a proper subset oflm{(j)T)- 
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The phylogenetic variety, Vt, is the the smallest algebraic variety in con- 
taining 0r(C^), i.e., the closure of the image of (f}T under the Zariski topology, 

Vt = Im(0T) C C^". 

Remark 3 Vr coincides with the closure oflm{(fiT) = 0t(C^) under the usual 
topology on . However, while Vr H [0,1]'^" contains the closure of c/)t{St) 
under the usual topology, these need not be equal. 

Let C[P] denote the ring of polynomials in the in determinates {pii...i„}- 
Then the collection of all polynomials in C[P] vanishing on Vr forms a prime 
ideal It- We refer to It as a phylogenetic ideal, and its elements as phylogenetic 
invariants. More explicitly, a polynomial / G C[P] is a phylogenetic invariant 
if, and only if, /(Pq) = for every Pq G (/)t{C'^"), or equivalently, if, and only 
if, /(Po) = Ofor every PoeMSr). 



As we proceed, we consider first the special case of 4-taxon trees. We highlight 
the K = 2 case, in part to illustrate the arguments for general k more clearly, 
and in part because we can go further in understanding the 2-state model. 

Consider the 4-taxon binary tree Tab\cd, with taxa a, b, c, d as shown in Figure 
1. 



Fig. 1. The 4-taxon tree Tg^b\cd 

Suppose that Pisa2x2x2x2 pattern frequency array, whose indices 
correspond to states [2] = {1,2} at the taxa in alphabetical order. Then the 
internal edge e of T defines the split ab \ cd in the tree, and we define the edge 
flattening of P at e, a 2^ x 2^ matrix, by 

^PUU PU12 P1121 Pll22^ 
Pl2U Pl212 Pl221 Pl222 
P2III P2II2 P212I P2122 
\P22U P2212 P222I P2222I 



(3) 



Notice that the rows of F^. are indexed by the states at {ah} and the columns 
by states at {cd}. The flattening F^ is intuitively motivated by considering a 
'collapsed' model induced by e: taxa a and b are grouped together forming 
a single variable {ab} with 4 states, and the grouping {cd} forms a second 
variable with 4 states. 
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This construction can be generalized in a natural way: suppose T is an n-taxon 
tree, and P a k x ■ ■ ■ x k array with indices corresponding to the taxa labeling 
the leaves of T. Then for any edge e in T, we can form from P the matrix Fg 
of size X K^'^, where rii and n2 are the cardinahties of the two sets of taxa 
in the split induced by e. 



Prom [15] (for a more expository presentation, see also [18]), we have: 

Theorem 4 For the 2- state GM model on a binary n-taxon tree T , the phy- 
logenetic ideal It is generated by all3 x 3 minors of all edge flattenings Fg of 
P. Moreover, for the K-state GM model on an n-taxon tree T, the phylogenetic 
ideal It contains all {k-\- 1) x {k-\- 1) minors of all edge flattenings of P. 



Using this result, we can deduce some elements of the phylogenetic ideal for 
the GM+I model for any number of taxa n > 4 and any number of states 

K>2. 



Proposition 5 (Phylogenetic Invariants for GM+I) 

(1) For the A-taxon tree Ta,b\cd o-nd the 2-state GM+I model, the cubic deter- 
minantal polynomials 



P1112 P1121 P1122 
Pl212 Pl221 Pl222 
P2II2 P212I P2122 



and /2 = 



P12II />1212 />1221 

P2111 P2112 P2121 
P2211 P2212 P2221 



are phylogenetic invariants. These are the two 3x3 minors of the matrix 
flattening Fab\cd of equation (3) that do not involve either of the entries 
Pun or P2222- 

(2) More generally, for n > 4 and n > 2, consider the n-state GM+I model 
on an n-taxon tree T . Then for each edge e of T , all [k + 1) x {k + 1) 
minors of the flattening Fg of P that avoid all entries Pn...i, i £ [k] are 
phylogenetic invariants. 



PROOF. We prove the first statement in detail. From equation (2), for any 
P = 4>t{s) we have P = (1 — 5)Pgm + ^-P/, where Pgm is a 4-dimensional 
table arising from the GM model on T and Pj = diag(7r/) is a diagonal table 
with entries giving the distribution of states for the invariable sites. Flattening 
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these tables with respect to the internal edge of the tree, we obtain 



ab\cd 



{1-S)Fgm + SFi 



[1-5) 



P12II P1212 P122I P1222 
P2III P2II2 P212I P2122 
\P22il P2212 P222I P2222 J 



+ 5 



\ 





\ ^ 7r7(2)y 



(4) 



By Theorem 4, all 3 x 3 minors of Fqm vanish. Since the 'upper right' and 
'lower left' minors of F„b\cd are the same as those of Fqmi up to a factor of 
(1 — 5)^, they also vanish. 



Straightforward modifications to this argument give the general case. □ 



For arbitrary n, re, the GM+I model should have many other invariants than 
those found here. Among these is, of course, the stochastic invariant 

fs{p) = 1 - E ^^i- 



In the simplest interesting case of the GM+I model, however, we have the 
following computational result. 

Proposition 6 The phylogenetic ideal for the 2-state GM+I model on the 
A-taxon tree Tab\cd of Figure 1 is generated by fg and the minors fi, f^ above; 

It = {fs, fi, /2)- 



PROOF. A computation of the Jacobian of the parameterization <pT '■ S C 
<C}^ — * shows it has full rank at some points, and so Vt is of dimension 13. 
If / = (/s, /i, /2), then / C /y. Another computation shows that / is prime and 
of dimension 13. Thus, necessarily / = It- (The code for these computations 
is given in Appendix A.) □ 



Let Vab\cdi yac\bc-, Kd|bc be the varieties for the 2-state GM+1 models for the 
three 4-taxon binary tree topologies, with corresponding phylogenetic ideals 
Iah\cd-i Iac\bd-i Iad\bc- Of coursc Proposition 6 gives generators for each of these 
ideals — two 3x3 minors of the flattenings of P appropriate to those tree 
topologies, along with fg. A computation (see Appendix A) shows that these 
three ideals are distinct. Therefore the three varieties are distinct, and their 
pairwise intersections are proper subvarieties. Thus for any parameters s not 
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lying in the inverse image of these subvarieties, T is uniquely determined from 
(pris)- Thus we obtain 

Corollary 7 For the 2-state GM+I model on binary 4-taxon trees, the tree 
parameter is generically identifiable. 

As dim{Vab\cd) = 13, and the parameter space for 4>t is 13 dimensional, we 
also immediately obtain that the map (pT is generically finite. This yields 

Corollary 8 For the 2-state GM+I model on a binary 4-taxon tree, numerical 
parameters are generically locally identifiable. 

Note that this does approach does not yield the cardinality of the generic fiber 
of the parameterization map, which is also of interest. We will return to this 
issue in Theorem 13. 

Further computations show that dim(Kb|cd H Kc|6d H Vad\bc) = H- As this 
intersection contains all points arising from the GM+I model on the 4-taxon 
star tree, which is an 11-parameter model, this is not surprising. In fact, one 
can verify computationally that the ideal Iab\cd + Iac\bd + Iad\bc is the defining 
prime ideal of the star-tree variety. We also note that the ideal Iab\cd + Iac\bd 
decomposes into two primes, both of dimension 11. Thus the variety defined 
by this ideal has two components, one of which is the variety for the star tree. 

In principle, the ideal It of all invariants for the GM+I model on an arbitrary 
tree T can be computed from the parameterization map 0t via an elimination 
of variables using Grobner bases [19]. However, if all invariants for the ft:-state 
GM model on T are known, they can provide an alternate approach to finding 
It which, while still proceeding by elimination, should be less computationally 
demanding. 

To present this most simply, we note that because our varieties lie in the 
hyperplane described by the stochastic invariant, it is natural to consider 
their projectivizations, lying in P'^"^! rather than C". The corresponding 
phylogenetic ideals, which we denote by Jt, are generated by the homogeneous 
polynomials in It, and do not contain the stochastic invariant. Conversely, It 
is generated by the elements of Jt together with the stochastic invariant. 

In addition, we need not restrict ourselves to the GM model, but rather deal 
with any phylogenetic model parameterized by polynomials. 

Proposition 9 Suppose (pT '■ C*" is a parameterization map for some 

phylogenetic model M. on T, with corresponding homogeneous phylogenetic 
ideal Jt. Let 

(f)T : X C ^ 
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be the parametrization map for the M. +1 model given by 

(/)t(s, {6, TT/)) = (1 - 6)(j)T{s) + 6 diag(7r7). 

Let P' denote the collection of all indeterminate entries of P except those in 
Peq = {Pii...i \ ^ ^ i'A} ■ Then the homogeneous phylogenetic ideal Jt for the 
M.+I model on T is Jt = {3t H C[P']^ C[P]. Thus Jt can be computed from 
Jt by elimination of the variables in P^q. 

PROOF. Extend the parameterization maps (I)t, (t>T to parameterizations of 
cones by introducing an additional parameter, 

$T(s,t) =t0T(s) 
$t(s, (5, TT/) ,t)^t (^t(s, {S, TTi)) 

Then Im($T) = C*^ x proj(Im($T)), where corresponds to coordinates in 
Peg and 'proj' denotes the projection map from P-coordinates to P'-coordinates 
As Jt is the ideal of polynomials vanishing on Im($r), and Jt H C[P'] the 
ideal vanishing on proj(Im($7^)), the result follows. □ 

Using this, in the appendix we give an alternate computation to show both 
part (1) of Proposition 5, and Proposition 6. While this computation is quite 
fast, a more naive attempt to find GM+I invariants directly from the full 
parameterization map using elimination was unsuccessful, demonstrating the 
utility of the proposition. Moreover, we can use this proposition to compute 
all 2-state GM+I invariants on the 5-taxon binary tree as well. This leads us 
to 

Conjecture 10 On an n-taxon binary tree, the ideal of homogeneous invari- 
ants for the 2-state GM+I model is generated by those 3x3 minors of edge 
fiattenings that do not involve the variables pii,„i andp22...2, together with the 
stochastic invariant. 

Although we are unable to determine all GM+I invariants for the 4-taxon tree 
for general k, using only those described in Proposition 5 we can still obtain 
identifiability results through a modified argument. 

Proposition 11 For the K-state GM -hi model on binary 4-taxon trees, k>2, 
the tree parameter is generically identifiable. 

PROOF. By the argument leading to Corollary 7, it is enough to show the 
varieties Vab\cd-i Kic|6d) a-nd Vad\bc a-re distinct. Considering, for example, the first 
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two, we can show that the varieties Vab\cd and Vac\bd arc distinct, by giving an 
invariant / e Iac\bd and a point Pq e Vab\cd such that /(Pq) 0. 

Using Proposition 5, we pick an invariant / G Iac\bd as follows: In the flattening 
Fac\bd according to the split ac\bd, choose any collection of k+1 ac-indices with 
distinct a and c states, e.g., {12, 13, ... , Ik, 21, 23}. Using the same set as bd- 
indices, this determines a (k + 1) x (ac + l)-minor /. 

We pick Pq — 0r^^|^^(s) using the parameterization of equation (1) by making 
a specific choice of parameters s. On Tab\cd, with the root r located at one of 
the internal nodes, choose parameters s as foUows: Let ttgm, tt/ be arbitrary 
but with all entries of ttgm positive. Pick any S e [0, 1). For the four terminal 
edges choose Mg to be the k, x k identity matrix 1^. For the single internal 
edge e of T, choose any Markov matrix Me with all positive entries. For such 
parameters, the entries of the joint distribution Po = 0T^i,|cd(^) ^^^^ except 
for the pattern frequencies pajj, where the states at the leaves a and b agree 
and the states at the leaves c and d agree. Since the entries of Mg and the root 
distributions are positive, each of the pujj > 0. 

But considering the flattening Fac\bd of Pq = (j^r^^^^is) with respect to the 
'wrong' topology Tac\bd, we observe that the non-zero entries pujj of Fac\bd 
all lie on the diagonal of Fac\bd, iu the positions with ij as both ac-index and 
6(i-index. Furthermore, by our choice of /, a subset of them forms the diagonal 
of the submatrix whose determinant is /. Therefore /(Pq) 7^ 0. □ 



Proposition 12 (Recovery of invariable site parameters) 



(1) For the 4-taxon tree Tab\cd (ind the 2-state GM+I model, suppose P ~ 
0r(s). Then generically the parameters in s related to invariable sites 
can be recovered from P by the following formulas: 



l^ll 


+ 






IPI 





1 



(l^lU^2|), 



where B — 



P1212 P1221 
\P2u2 P2121 J 

/ 



Pun P1112 P1121 
P1211 P1212 P1221 

^^P2111 P2II2 P212IJ 



A, 



I \ 

P\2\2 P\22\ P1222 
P2II2 P212I P2122 
^^P2212 P222I P2222y 



(2) More generally, for the n-state GM+I model on Tab\cd, the invariable site 
parameters can be recovered from a generic point in the image of the 
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parameterization map by rational formulas of the form 









B 





{\A^l\A2l...,\A^\). 



Here \B\ is any k,x k minor of Fab\ai that omits the all rows and columns 
indexed by ii, and \Ai\ is the x {n + 1) minor obtained by including 

all rows and columns chosen for B and in addition the ii row and ii 
column. 



PROOF. We give the complete argument in the case k — 2 first. For a joint 
distribution P e Im(0T), write Fah\cd — 0- ~ S)Fgm + ^Fj as in equation (4). 
Since Ai is the 'upper left' 3x3 submatrix of Fah\cd, using linearity properties 
of the determinant, and that all 3 x 3 minors of Fqm evaluate to zero, we 
observe that 



l^il 



(1 - 5f 



Pun P1112 P1121 

P1211 P1212 P1221 
P2111 P2112 P2121 



+ 



S-Kiil) 

(1 - (5)pi212 (1 - (^)Pl221 
(1 - 5)P2112 (1 - 5)P2121 



(1 - (J)pi212 (1 - ^)Pl221 
(1 - 5)P2112 (1 - (^)P2121 



Thus we have \Ai\ = 67ri{l)\B\. Now, if \B\ 7^ 0, then 

l^il 



\B\ 



As \B\ does not vanish on all of V^, we have a rational formula to compute 
(57r/(l) for generic points on Vr- 

Similarly, since A2 is the 'lower right' submatrix of Fab\cd, then 

1^1 



\B\ 



Adding these together, we obtain the stated rational expression for S. 
Assuming additionally the generic condition that S ^ 0, then we find 



TT/ 





Ai 






+ 


A2I 





A2 




1^ 


+ 


A2\ 



Thus the parameters S, tt/ are generically identifiable for GM+I on T. 
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One readily sees the argument above can be modified for arbitrary k. □ 



Note that when n > 2 the above proposition gives many alternative rational 
formulas for the invariable site parameters, as there are many options for 
choosing the matrix B. 

We now obtain our main result. 

Theorem 13 The n-state GM+I model on n-taxon binary trees, with n > 4, 
K >2, is generically locally identifiable. Furthermore, for an n-taxon tree with 
V vertices, the fibers of generic points of Vr under the parametrization map 
have cardinality k.\{\V\ —n). Thus for generic points, label swapping at internal 
nodes is the only source of non-identifiability. 



PROOF. Suppose T is an n-taxon tree with P = 0t(s). Choose some subset 

of 4 taxa, say {a, b, c, d}, and suppose the induced quartet tree is Tab\cd- Then 
Pabcd, the 4-marginalization of P, is easily seen to be of the form Pabcd = 
4>T^b\cdi^abcd) where Sabcd = fi'(s) and g is a surjective polynomial function. But 
the tree Tab\cd is generically identifiable by Proposition 11, and thus invariable 
site parameters in Sabcd ^-re generically identifiable by Proposition 12. As these 
coincide with the invariable site parameters in s, and generic conditions on Sabcd 
imply generic conditions on s, the invariable site parameters are generically 
identifiable for the full n-taxon model. 

As an n-taxon binary tree topology is determined by the collection of all 
induced quartet tree topologies, one can now see that T is generically identifi- 
able. Alternately, using the identified invariable site parameters, and assuming 
the additional generic condition that S ^ 1, note that 

is a joint distribution arising from general Markov parameters. Thus generic 
identifiability of the tree can also by obtained from Steel's result for the GM 
model [2] applied to Pqm- 

The generic identifiability of the remaining numerical parameters follows from 
Chang's argument [3] applied to Pgm- Chang's approach also indicates the 
cardinality of the generic fiber is — n) due to the label swapping phe- 

nomenon. □ 
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5 Estimating Invariable Sites Parameters 



The concrete result in Proposition 12 gives explicit rational formulas for re- 
covering parameters relating to invariable sites from the joint distribution. 
These can be viewed as generalizations of the formulas found in [1 1] for group- 
based models. As [11] develops the group-based model formulas into a heuristic 
means of estimating the invariable site parameters from data without perform- 
ing a full Maximum Likelihood fit of data to a tree under a M.+1 model, one 
might suspect the formulas of Proposition 12 could be used similarly without 
the need to assume Ai was group-based, or approximately group-based. We 
emphasize that however useful such an estimate might be, it would not be 
intended to replace a more statistical but time-consuming computation, such 
as obtaining the Maximum Likelihood estimates for these parameters. 

However, it is by no means obvious how to use these formulas well even for 
a heuristic estimate. First, for a 4-taxon tree we have many choices for the 
matrix B, in fact 



of them, so even for k — A, there are 245,025 basic sets of the formulae. 
Moreover, while these simple formulae emerged from our method of proof, 
one could in fact modify them by adding to any of them a rational function 
whose numerator is a phylogcnctic invariant for the GM+1 model, and whose 
denominator is not. Since the invariant vanishes on any joint distribution 
arising from the model, the resulting formulae will still recover invariable site 
information for generic parameters. Thus there are actually infinitely many 
formulas for recovering invariable site parameters. 

One can nonetheless consider simple averaging schemes using only the basic 
formulas of Proposition 12 and find that on simulated data they perform quite 
well at approximately recovering invariable site parameters from empirical 
distributions. However, averaging the large number of formulas give here, and 
then also averaging over a large sample of quartets, as is proposed in [11], 
is more time consuming than one might wish for a fast heuristic. Moreover, 
one must be aware that the denominator in these formulas may vanish on an 
empirical distribution — it is certain to be non-zero only for true distributions 
for GM-I-I arising from generic parameters. 

Nonetheless, it would be of interest to develop versions of these formulas with 
good statistical estimation properties, as the GM-I-I model encompasses mod- 
els such as the GTR+1 model which is often preferred in biological data analy- 
sis to group-based-|-l models. Of course addressing more general rate- variation 
models would be even more desirable, though our results here are not sufficient 
for that. 
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A Code for Computational Algebra Software 



The following code is also available on the authors' websites. 



A.l Computation for Proposition 6 



To show the variety has dimension 13, we execute the following Maple code: 

pa := Matrlx([[p,l-p]]); Mae : = Matrix( [ [l-a,a] , [r , 1-r] ] ) ; 
Meb := Matrix([[l-b,b] , [s,l-s]]); Mef : = Matrix( [ [1-e ,e] , [t , 1-t] ] ) ; 
Mfc := Matrix([[l-c,c] , [u,l-u]]); Mfd := Matrix ([ [l-d,d] , [v, 1-v] ]) ; 
P := ArrayCl. .2,1. .2,1. .2,1. .2) ; 

for i from 1 to 2 do for j from 1 to 2 do for k from 1 to 2 do for 1 from 1 to 2 do 
P[i, j ,k,l] :=0; 

for m from 1 to 2 do for n from 1 to 2 do 

P[i,j ,k,l] :=P[i,j ,k,l]+pa[l,i] *Mae [i ,m] *Meb[m, j] *M6f [m,n] *Mf c [n,k] *Mf d[n,l] ; 
od;od; 

P[i,j,k,l] : = (l-w)*P[i,j,k,l] ; 
od;od;od;od; 

P[l, 1,1,1] :=P[l,l,l,l]+w*q: P[2,2,2,2] :=P[2,2,2,2]+w*(l-q) : 
Q:=ListTools [Flatten] (convert (P.listlist) ) : 
J:=VectorCalculus[Jacoblan] (Q, [a,b,c,d,e,r,s,t,u,v,p,q,w]) : 

K:=subs({a=l/3,b=l/5,c=l/7,d=l/ll,e=l/13,r=l/17,s=l/19,t=l/23,u=l/29,v=l/31, 

p=l/3,q=l/5,w=l/7},J): 

LinearAlgebra[Rank] (K) ; 

Using Singular [20], we complete the proof: 

LIB "matrix. lib" ; LIB "primdec . lib" ; 

ring r = 0, (p0,pl,p2,p3,p4,p5,p6,p7,p8,p9,pl0,pll,pl2,pl3,pl4,pl5) ,dp; 

// Define matrix flattening F_{ab I cd} and polys fs, fl, f2 

matrix Fab [4] [4] =pO,pl ,p2,p3 ,p4,p5,p6,p7,p8,p9,pl0,pll ,pl2 ,pl3,pl4,pl5; 

matrix UR [3] [3] =submat (Fab, 1 . . 3, 2 . .4) ; matrix LL [3] [3] =submat (Fab , 2 . . 4 , 1 . . 3) ; 

poly fl=det(UR); poly f2=det(LL); 

poly fs = p0+pl+p2+p3+p4+p5+p6+p7+p8+p9+pl0+pll+pl2+pl3+pl4+pl5-l; 
ideal I = fs,fl,f2; // define ideal I 
dim(std(I)); // compute dimension of r/I 

primdecGTZ(I) ; // compute primary decomposition of I to show prime 



A . 2 Computation for intersections of K&icd, Kc|6d, Vad\bc 



Continuing the Singular session above, we execute the following: 

/* Define ideals lac, lad corresponding to two alternative tree 

topologies for 4-taxon trees. (So, I = lab in this notation.) */ 

// Flattening for ac I bd split 

matrix Fac[4] [4] =pO,pl ,p4,p5 ,p2 ,p3 ,p6 ,p7,p8,p9 ,pl2 ,pl3,pl0 ,pll ,pl4,pl5 ; 
poly f3=det(submat(Fac,l. .3,2. .4)) ; poly f 4=det (submat (Fac ,2 . .4, 1 . .3) ) ; 
Ideal lac = fs,f3,f4; 
// Flattening for ad I be split 

matrix Fad [4] [4] =p0,p2,p4,p6 ,pl ,p3,p5,p7,p8,pl0,pl2 ,pl4,p9,pll ,pl3,pl5 ; 
poly f 5=det (submat (Fad, 1. .3,2. .4)) ; poly f6=det(submat(Fad,2. .4,1. .3)) ; 
ideal lad = fs,f5,f6; 

reduce(f l,std(Iac)) ; // non-zero answer shows fl not in lac 
reduce (lac, std(I) ) ; // non-zero shows f3,f4 not in I 
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ideal J = I, lac; diiii(std( J) ) ; // show dim is 11 
ideal K = J, lad; dim(std(K) ) ; // show dim is 11 

primdecGTZ(K) ; // show K prime, and thus ideal for star tree 



A. 3 Computation of 2- state GM+I ideal, 4-taxon trees, using Proposition 9 



The following Singular code performs the needed elimination for a binary tree: 

ideal Igm = minor (Fab , 3) ; 

// Eliminate the 'diagonal' variables 

ideal Igmi = eliml(Igm,pO*pl5) ; 

For the star tree, the 2-state GM ideal is known from [15]. Thus elimination 
can be used to find GM+I invariants. We also show this result agrees with K 
above. 

ideal Igm = minor (Fab, 3) .minor (Fac ,3) , minor (Fad, 3) ; 
// Eliminate the 'diagonal' variables 
ideal Igmi = eliml (Igm,p0*pl5) , f s ; 

reduce (K, std(Igmi) ) ; // all O's indicates ideal containment 
reduce ( Igmi, std (K) ) ; // all O's indicates ideal containment 
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